In this notebook, we load the data collected from GitHub API v3 (see GitHub - Repositories from API notebook) and we check for every repository if this repository has a DESCRIPTION
file at its root. This is the condition we use to identify which repositories are storing a R package.
In [1]:
import gzip
import json
import requests
import pandas
from collections import OrderedDict
In [2]:
INPUT_FILENAME = '../data/R-apiv3-2015-01-01T00:00:00-2015-06-01T00:00:00.tar.gz'
OUTPUT_FILENAME = '../data/RPackage-Repositories-150101-150601.csv'
# Set of attributes that will be kept for the output
# If a dot is found in the attribute, a nested lookup will be performed
ATTR = [
'full_name',
'name',
'owner.login',
'owner.type',
'created_at',
'description',
'forks_count',
'stargazers_count',
'watchers_count',
'has_downloads',
'has_pages',
'has_issues',
'has_wiki',
]
We make use of IPython's parallel computation.
To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using ipcluster start -n 4
for example. See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.
It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the Cluster tab.
In [3]:
from IPython import parallel
clients = parallel.Client()
clients.block = False
print 'Clients:', str(clients.ids)
We first load and identify which are the distinct repositories that were gathered using GitHub API v3.
In [4]:
with gzip.GzipFile(filename=INPUT_FILENAME) as gf:
content = gf.read()
In [5]:
content = json.loads(content)
In [6]:
print '{} items were retrieved from GitHub API v3'.format(len(content))
In [7]:
distinct = {(r['name'], r['owner']['login'], r['full_name']): r for r in content}
print '{} distinct items inside'.format(len(distinct))
We now filter the items to keep only interesting attributes.
In [8]:
def filter_attributes(item, attributes):
new_item = OrderedDict()
for attr in attributes:
if '.' in attr:
attr1, attr2 = attr.split('.')
new_item['{}.{}'.format(attr1, attr2)] = item[attr1][attr2]
else:
new_item[attr] = item[attr]
return new_item
In [9]:
items = map(lambda r: filter_attributes(r, ATTR), distinct.values())
And we're ready to check if the repository has a DESCRIPTION
file at its root.
In [10]:
def check_item(item):
url = 'https://raw.githubusercontent.com/{}/master/DESCRIPTION'.format(item['full_name'])
response = requests.get(url)
if response.status_code == 200:
item['package'] = 1
else:
item['package'] = 0
return item
print len(items), 'items to check'
clients[:].execute('import requests')
balanced = clients.load_balanced_view()
res = balanced.map(check_item, items, ordered=False, timeout=15)
import time
while not res.ready():
time.sleep(5)
print res.progress, ' ',
In [22]:
results = [r[0] for r in res._result if not isinstance(r, parallel.error.RemoteError)]
df = pandas.DataFrame(results).query('package == 1')[ATTR].set_index('full_name')
df.to_csv(OUTPUT_FILENAME, encoding='utf-8')
In [23]:
print len(df), 'packages found'